{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "9a4cb825-0940-44a7-9f79-c1ca73b37906",
   "metadata": {},
   "source": [
    "# Notebook 3: Document Question-Answering with LlamaIndex\n",
    "This notebook demonstrates how to use [LlamaIndex](https://gpt-index.readthedocs.io/en/stable/) to build a chatbot that references a custom knowledge base. \n",
    "\n",
    "Suppose you have some text documents (PDF, blog, Notion pages, etc.) and want to ask questions related to the contents of those documents. LLMs, given their proficiency in understanding text, are a great tool for this. \n",
    "\n",
    "<div class=\"alert alert-block alert-info\">\n",
    "    \n",
    "⚠️ The notebook before this one, `02_langchain_index_simple.ipynb`, contains the same functionality as this notebook but uses some LangChain components instead of LlamaIndex components. \n",
    "\n",
    "Concepts that are used in this notebook are explained in-depth in the previous notebook. If you are new to retrieval augmented generation, it is recommended to go through the previous notebook before this one. \n",
    "\n",
    "Ultimately, we recommend reading about LangChain vs. LlamaIndex and picking the software/components of the software that makes the most sense to you. This is discussed a bit further below. \n",
    "\n",
    "</div>\n",
    "\n",
    "### [LlamaIndex](https://gpt-index.readthedocs.io/en/stable/)\n",
    "[**LlamaIndex**](https://gpt-index.readthedocs.io/en/stable/) is a data framework for LLM applications to ingest, structure, and access private or domain-specific data. Since LLMs are both only trained up to a fixed point in time and do not contain knowledge that is proprietary to an Enterprise, they can't answer questions about new or proprietary knowledge. LlamaIndex helps solve this problem by providing data connectors to ingest data, indices to structure data for storage, and engines to communicate with data. \n",
    "\n",
    "\n",
    "### [LlamaIndex](https://gpt-index.readthedocs.io/en/stable/) or [LangChain](https://python.langchain.com/docs/get_started/introduction)?\n",
    "\n",
    "It's recommended to read more about the unique strengths of both LlamaIndex and LangChain. At a high level, LangChain is a more general framework for building applications with LLMs. LangChain is (currently) more mature when it comes to multi-step chains and some other chat functionality such as conversational memory. LlamaIndex has plenty of overlap with LangChain, but is particularly strong for loading data from a wide variety of sources and indexing/querying tasks. \n",
    "\n",
    "Since LlamaIndex can be used *with* LangChain, the frameworks' unique capabilities can be leveraged together; the combination of the two is demonstrated in this notebook.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d76e8af7-2124-4cb6-8ade-e1c1c42d1701",
   "metadata": {},
   "source": [
    "### Step 1: Integrate TensorRT-LLM to LangChain *and* LlamaIndex\n",
    "#### Customized LangChain LLM in LlamaIndex\n",
    "Langchain allows you to create custom wrappers for your LLM in case you want to use your own LLM or a different wrapper than the one that is supported in LangChain. Since we are using LlamaIndex, we have written a custom langchain wrapper compatible with LlamaIndex.\n",
    "\n",
    "We can easily take a custom LLM that has been wrapped for LangChain and plug it into [LlamaIndex as an LLM](https://docs.llamaindex.ai/en/stable/understanding/using_llms/using_llms.html#using-llms)! We use the [LlamaIndex LangChainLLM library](https://gpt-index.readthedocs.io/en/latest/api_reference/llms/langchain.html) so the LangChain LLM can be used in LlamaIndex. \n",
    "\n",
    "<div class=\"alert alert-block alert-warning\">\n",
    "    \n",
    "<b>WARNING!</b> Be sure to replace `server_url` with the address and port that Triton is running on. \n",
    "\n",
    "</div>\n",
    "\n",
    "Use the address and port that the Triton is available on; for example `localhost:8001`. **If you are running this notebook as part of the generative ai workflow, your can use the existing url.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "8a80987e-1ddb-4248-b76c-f3ce16745ca3",
   "metadata": {},
   "outputs": [],
   "source": [
    "from triton_trt_llm import TensorRTLLM\n",
    "from llama_index.llms import LangChainLLM\n",
    "trtllm =TensorRTLLM(server_url =\"llm:8001\", model_name=\"ensemble\", tokens=500)\n",
    "llm = LangChainLLM(llm=trtllm)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bc57b68d-afd5-4a0c-832c-0ad8f3f475d5",
   "metadata": {},
   "source": [
    "### Step 2: Create a Prompt Template\n",
    "\n",
    "A [**prompt template**](https://gpt-index.readthedocs.io/en/latest/core_modules/model_modules/prompts.html) is a common paradigm in LLM development.\n",
    "\n",
    "They are a pre-defined set of instructions provided to the LLM and guide the output produced by the model. They can contain few shot examples and guidance and are a quick way to engineer the responses from the LLM. Llama 2 accepts the [prompt format](https://huggingface.co/blog/llama2#how-to-prompt-llama-2) shown in `LLAMA_PROMPT_TEMPLATE`, which we manipulate to be constructed with:\n",
    "- The system prompt\n",
    "- The context\n",
    "- The user's question\n",
    "  \n",
    "Much like LangChain's abstraction of prompts, LlamaIndex has similar abstractions for you to create prompts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "682ec812-33be-430f-8bb1-ae3d68690198",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index import Prompt\n",
    "\n",
    "LLAMA_PROMPT_TEMPLATE = (\n",
    " \"<s>[INST] <<SYS>>\"\n",
    " \"Use the following context to answer the user's question. If you don't know the answer, just say that you don't know, don't try to make up an answer.\"\n",
    " \"<</SYS>>\"\n",
    " \"<s>[INST] Context: {context_str} Question: {query_str} Only return the helpful answer below and nothing else. Helpful answer:[/INST]\"\n",
    ")\n",
    "\n",
    "qa_template = Prompt(LLAMA_PROMPT_TEMPLATE)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "056850b3-70c6-438a-9c35-e017ab611252",
   "metadata": {},
   "source": [
    "### Step 3: Load Documents\n",
    "\n",
    "<div>\n",
    "<img src=\"./imgs/llama_hub.png\" width=\"500\"/>\n",
    "</div>\n",
    "\n",
    "LlamaIndex provides [**data loaders**](https://docs.llamaindex.ai/en/stable/module_guides/loading/connector/root.html#data-connectors-llamahub) through [Llama Hub](https://llamahub.ai/). These allow for custom data sources to be connected to your LLM via plugins. So, for example, plugins are available to load documents from [Jira](https://llamahub.ai/l/jira), [Outlook Calendar](https://llamahub.ai/l/outlook_localcalendar), [Slack](https://llamahub.ai/l/slack), [Trello](https://llamahub.ai/l/trello), and many other applications. \n",
    "\n",
    "At the core of each data loader is a `download_loader` function which downloads the loader file into a module that you can use in your application. Once the loader is downloaded, data is ingested through the loader. The output of this ingestion is data formatted as a LlamaIndex [**Document**](https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/root.html#documents-nodes) (text and metadata). \n",
    "\n",
    "Similar to the previous notebook with LangChain, an [`UnstructuredReader`](https://llamahub.ai/l/file-unstructured) is used in this example. However, this time it's from from [Llama Hub](https://llamahub.ai/) (LlamaIndex). Again, we load a research paper about Llama2 from Meta. \n",
    "\n",
    "[Here](https://python.langchain.com/docs/integrations/document_loaders) are some of the other document loaders available from LangChain."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "e9457012-e436-4371-9157-56c1ce4be667",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "File ‘llama2_paper.pdf’ already there; not retrieving.\n"
     ]
    }
   ],
   "source": [
    "! wget -O \"llama2_paper.pdf\" -nc --user-agent=\"Mozilla\" https://arxiv.org/pdf/2307.09288.pdf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "4f9adbc8-2060-4b16-9252-ac6727b862ee",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package punkt to /root/nltk_data...\n",
      "[nltk_data]   Package punkt is already up-to-date!\n",
      "[nltk_data] Downloading package averaged_perceptron_tagger to\n",
      "[nltk_data]     /root/nltk_data...\n",
      "[nltk_data]   Package averaged_perceptron_tagger is already up-to-\n",
      "[nltk_data]       date!\n"
     ]
    }
   ],
   "source": [
    "from llama_hub.file.unstructured.base import UnstructuredReader\n",
    "import time\n",
    "\n",
    "loader = UnstructuredReader()\n",
    "start_time = time.time()\n",
    "documents = loader.load_data(file=\"llama2_paper.pdf\")\n",
    "print(f\"--- {time.time() - start_time} seconds ---\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f03d6d82-8157-4dbc-97dd-29e3b990f8aa",
   "metadata": {},
   "source": [
    "### Step 4: Transform Documents with Text Splitting and a Node Parser\n",
    "#### a) Generate Embeddings \n",
    "Once documents have been loaded, they are often transformed. One method of transformation is known as **chunking**, which breaks down large pieces of text, for example, a long document, into smaller segments. This technique is valuable because it helps [optimize the relevance of the content returned from the vector database](https://www.pinecone.io/learn/chunking-strategies/). \n",
    "\n",
    "This is the same process as the previous notebook; again, we use a LangChain text splitter. In this example, we use a [``SentenceTransformersTokenTextSplitter``](https://api.python.langchain.com/en/latest/text_splitter/langchain.text_splitter.SentenceTransformersTokenTextSplitter.html#langchain.text_splitter.SentenceTransformersTokenTextSplitter). The ``SentenceTransformersTokenTextSplitter`` is a specialized text splitter for use with the sentence-transformer models. The default behavior is to split the text into chunks that fit the token window of the sentence transformer model that you would like to use. This sentence transformer model is used to generate the embeddings from documents.\n",
    "\n",
    "There are some nuanced complexities to text splitting since semantically related text, in theory, should be kept together. \n",
    "\n",
    "To use the Langchain's `SentenceTransformersTokenTextSplitter` with LlamaIndex we use the [**Langchain node parser**](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules.html#langchainnodeparser) on top of the text splitter from LangChain. This is not required, but since LlamaIndex provides a [**node structure**](https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/root.html#documents-nodes), we choose to use this functionality to level up our storage of documents. \n",
    "\n",
    "**Nodes** represent chunks of source documents, but they also contain metadata and relationship information with other nodes and index structures. Since nodes provide these additional forms of hierarchy and connections across the data, they can help generate more accurate answers upon retrieval."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "fa366250-108e-45a0-88ce-e6f7274da8e1",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/usr/local/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n"
     ]
    }
   ],
   "source": [
    "from langchain.text_splitter import SentenceTransformersTokenTextSplitter\n",
    "from llama_index.node_parser import LangchainNodeParser\n",
    "\n",
    "\n",
    "TEXT_SPLITTER_MODEL = \"intfloat/e5-large-v2\"\n",
    "TEXT_SPLITTER_TOKENS_PER_CHUNK = 510\n",
    "TEXT_SPLITTER_CHUNCK_OVERLAP = 200\n",
    "\n",
    "text_splitter = SentenceTransformersTokenTextSplitter(\n",
    "    model_name=TEXT_SPLITTER_MODEL,\n",
    "    tokens_per_chunk=TEXT_SPLITTER_TOKENS_PER_CHUNK,\n",
    "    chunk_overlap=TEXT_SPLITTER_CHUNCK_OVERLAP,\n",
    ")\n",
    "\n",
    "node_parser = LangchainNodeParser(text_splitter)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cf9e2595-ae85-4c00-b561-d7d1a40933bf",
   "metadata": {},
   "source": [
    "Additionally, we use a LlamaIndex [``PromptHelper``](https://gpt-index.readthedocs.io/en/latest/api_reference/service_context/prompt_helper.html) to help deal with LLM context window token limitations. It calculates available context size to the LLM by taking the initial context token length and subtracting out reserved token space for the prompt template and output. It provides a utility for re-packing text chunks from the index to maximally use the context window to minimize requests sent to the LLM.\n",
    "\n",
    "- ``context_window``: context window for the LLM -- the context length for Llama2 is 4k tokens\n",
    "- ``num_ouptut``: number of output tokens for the LLM\n",
    "- ``chunk_overlap_ratio``: chunk overlap as a ratio to chunk size\n",
    "- ``chunk_size_limit``: maximum chunk size to use"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "dc9a6082-34a0-4aa7-964b-7fe3f2015aa9",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index import PromptHelper\n",
    "\n",
    "prompt_helper = PromptHelper(\n",
    "  context_window=4096,\n",
    "  num_output=256,\n",
    "  chunk_overlap_ratio=0.1,\n",
    "  chunk_size_limit=None\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b8dab583-a12d-4fb1-a9eb-3a1b1f04075d",
   "metadata": {},
   "source": [
    "### Step 5: Generate and Store Embeddings\n",
    "#### a) Generate Embeddings \n",
    "[Embeddings](https://docs.llamaindex.ai/en/stable/module_guides/models/embeddings.html#embeddings) for documents are created by vectorizing the document text; this vectorization captures the semantic meaning of the text. This allows you to quickly and efficiently find other pieces of text that are similar. \n",
    "\n",
    "When a user sends in their query, the query is also embedded using the same embedding model that was used to embed the documents. As explained earlier, this allows us to find similar (relevant) documents to the user's query. \n",
    "\n",
    "Like other sections in this notebook, we can easily take a LangChain embedding object and use with LlamaIndex. We use the [LangchainEmbedding library](https://docs.llamaindex.ai/en/stable/api_reference/service_context/embeddings.html#langchainembedding), which acts as a wrapper around Langchain's embedding models. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "e9011ba0-f3f6-41f0-8a15-48f264743545",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.embeddings import HuggingFaceEmbeddings\n",
    "from llama_index.embeddings import LangchainEmbedding\n",
    "\n",
    "#Running the model on CPU as we want to conserve gpu memory.\n",
    "#In the production deployment (API server shown as part of the 5th notebook we run the model on GPU)\n",
    "model_name=\"intfloat/e5-large-v2\"\n",
    "model_kwargs = {\"device\": \"cpu\"}\n",
    "encode_kwargs = {\"normalize_embeddings\": False}\n",
    "hf_embeddings = HuggingFaceEmbeddings(\n",
    "    model_name=model_name,\n",
    "    model_kwargs=model_kwargs,\n",
    "    encode_kwargs=encode_kwargs,\n",
    ")\n",
    "# Load in a specific embedding model\n",
    "embed_model = LangchainEmbedding(hf_embeddings)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8db99124-e438-406d-880d-557501a461d3",
   "metadata": {},
   "source": [
    "#### b) Store Embeddings \n",
    "\n",
    "LlamaIndex provides a supporting module, [`ServiceContext`](https://docs.llamaindex.ai/en/stable/module_guides/supporting_modules/service_context.html#servicecontext), to bundle commonly used resources during the indexing and querying stage. In this example, we bundle resources we've built: the LLM, the embedding model, the node parser, and the prompt helper.   "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "0e493f9d-589a-4820-902d-f68932bfb0d8",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index import ServiceContext\n",
    "service_context = ServiceContext.from_defaults(\n",
    "  llm=llm,\n",
    "  embed_model=embed_model,\n",
    "  node_parser=node_parser,\n",
    "  prompt_helper=prompt_helper\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d339a5b9-0d76-43e7-86d7-0f544f0805a2",
   "metadata": {},
   "source": [
    "Set the service context globally, to avoid passing it to every llm call/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "ba0efae7-a8ad-4db0-80ea-7edd69bf4719",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index import set_global_service_context\n",
    "set_global_service_context(service_context)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "79c7923c-d778-4f32-be37-4314063ecd2f",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-info\">\n",
    "    \n",
    "⚠️ in the deployment of this workflow, [Milvus](https://milvus.io/) is running as a vector database microservice.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "1e94e53e-41a9-47d3-a9d3-7c0af4c07f76",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index import VectorStoreIndex\n",
    "from llama_index.storage.storage_context import StorageContext\n",
    "from llama_index.vector_stores import MilvusVectorStore\n",
    "\n",
    "vector_store = MilvusVectorStore(uri=\"http://milvus:19530\", dim=1024, overwrite=False)\n",
    "storage_context = StorageContext.from_defaults(vector_store=vector_store)\n",
    "index = VectorStoreIndex.from_vector_store(vector_store)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "5783e23b",
   "metadata": {},
   "source": [
    "Let's load the documents into the vector database index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "474b8820",
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "start_time = time.time()\n",
    "nodes = node_parser.get_nodes_from_documents(documents)\n",
    "index.insert_nodes(nodes)\n",
    "print(f\"--- {time.time() - start_time} seconds ---\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "57e7aa7f-a219-44fe-8757-432daf278f6a",
   "metadata": {},
   "source": [
    "### Step 6: Build the Query Engine and Stream Response\n",
    "\n",
    "#### a) Build the Query Engine\n",
    "\n",
    "A query engine is an object that takes in a query and returns a response. Each vector index has a default corresponding query engine; for example, the default query engine for a vector index performs a standard top-k retrieval over the vector store.\n",
    "\n",
    "A query engine contains the following components:\n",
    "- Retriever\n",
    "- Node PostProcessor\n",
    "- Response Synthesizer "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f56f37e0-341e-4d7d-b282-f374a16f55b2",
   "metadata": {},
   "outputs": [],
   "source": [
    "query_engine = index.as_query_engine(text_qa_template=qa_template, streaming=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a2359014-ef1f-4d0f-bac9-8fdd37a93351",
   "metadata": {},
   "source": [
    "#### b) Stream a Response from the Query Engine\n",
    "Lastly, we pass the query engine a user's question and stream the response. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "38d23754-ea6b-47ce-8b3b-ebd37c0f5693",
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "\n",
    "start_time = time.time()\n",
    "response = query_engine.query(\"what is the context length of llama2?\")\n",
    "response.print_response_stream()\n",
    "print(f\"\\n--- {time.time() - start_time} seconds ---\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
